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It is widely admitted that structured nonparametric modeling that circumvents the curse of 
dimensionality is important in nonparametric estimation. In this paper we show that the same 
holds for semi-parametric estimation. We argue that estimation of the parametric component 
of a semi-parametric model can be improved essentially when more structure is put into the 
nonparametric part of the model. We illustrate this for the partially linear model, and investi- 
gate efficiency gains when the nonparametric part of the model has an additive structure. We 
present the semi-parametric Fisher information bound for estimating the parametric part of the 
partially linear additive model and provide semi-parametric efficient estimators for which we use 
a smooth backfitting technique to deal with the additive nonparametric part. We also present 
the finite sample performances of the proposed estimators and analyze Boston housing data as 
an illustration. 

Keywords: partially linear additive models; profile estimator; semi-parametric efficiency; 
smooth backfitting 

1. Introduction 

Structured nonparametric models such as additive models are known to circumvent the 
curse of dimensionality and allow reliable estimation when a full nonparametric model 
does not work. In the present paper we show that a similar assertion applies for semi- 
parametric models: structural modeling of the nonparametric part can lead to accurate 
estimation of the parametric part even in situations where otherwise only very poor, 
unreliable or unstable estimates would be available. We show this by comparing the 
partially linear and the partially linear additive model. In particular, we demonstrate that 
using an additive model for the nonparametric part in the partially linear model can lead 
to drastic gains of efficiency in the estimation of the parametric components. This holds 
if the dimension of the nonparametric covariates is high, or the parametric covariates 
can be approximated by non-additive transformations of the nonparametric covariates. 
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In the extreme of the latter case, if the approximation is exact, then estimation of the 
parametric part in the partially linear model breaks down. If the approximation is very 
crude, one sees large efficiency gains by using additive models for the nonparametric part. 

Suppose we observe the i.i.d. copies (Y 1 , X 1 , Z 1 ), . . . , (Y n , X™, Z") of a random vec- 
tor (Y, X, Z) , where X = (Xi , . . . , X p ) T g RP and Z = {Z\ , . . . , Z d ) T g R d . The partially 
linear model assumes 

Y = m () + X T (3 + m{Z ll ...,Z d )+e, (1) 

where (3 is an unknown p-vector and to is an unknown d-variate function. The partially 
linear additive model puts an additive structure to the nonparametric function to: 

Y = m + X T /3 + mi(Zi) + • •■ + m d {Z d ) + e. (2) 

These models exclude the interesting case where X or Z includes some endogeneous 
variables of Y , but they simplify our discussion on semi-parametric efficiency. We believe 
that our results can be extended to the corresponding semi-parametric models with time 
series data by following, for example, the arguments in [7]. 

For identifiability of the additive component functions mj, we put the constraints 
Errij(Zj) = 0, 1 < j < d. We assume that (X, Z) has a joint density q with respect to 
v = V\ x V2 , where v\ is a a- finite measure and v% is the Lebesgue measure on each support 
of X and Z, and that the marginal density of Z (with respect to denoted by qz, has 
compact support, say [0, l] d . The model (2) enjoys the advantages of both the partially 
linear model (1) and the nonparametric additive model to the fully nonparametric model. 
It accommodates discrete covariates since we only require that v\ is a cx-finite measure, 
and also interaction effects between covariates by putting them into the parametric part. 
By the additive structure in the nonparametric part it avoids the curse of dimensionality, 
but retains the flexibility of the model. It also renders easy interpretation of the individual 
role of each covariate. 

We discuss semi-parametric efficient estimation of the parameter (3 in the model (2). 
We present the semi-parametric Fisher information bound and provide an estimator that 
achieves the efficiency bound. Semi-parametric efficient estimation when d = 1 has been 
studied by Bhattacharya and Zhao [1], Cuzick [5] and Schick [17]. Their works can be 
easily extended to the model (1) for d > 1. Comparing the Fisher information bounds 
for the models (1) and (2), we find that the information bound under the model (2) is 
smaller than the bound under the model (1). In our semi-parametric model (2), we do 
not specify the distribution of the error term e or the distribution q of the covariates. We 
show that one can do as well without knowing those distributions. 

There have been a few works on the model (2). Opsomer and Ruppert [13] obtained 
a -^/n-consistent estimator of (3 by a backfitting method with undcrsmoothing. Recently 
Liang et al. [8] and Carroll et al. [4] studied the model with measurement error and 
repeated measurements, respectively. But they did not discuss semiparamctric efficiency. 
The model (1) has been studied more often; see [19], among others. Most studies, however, 
are rather focused on the cases where there is only a single-dimensional (or at most low- 
dimensional) nonparametric function to. This is because high-dimension costs higher- 
order smoothness in theory and poor small sample performances in practice. 
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2. Semi-parametric efficiency 

To avoid unnecessary complexity, we assume ttlq = 0. We also assume that e is indepen- 
dent with (X, Z), and that g, the density of e, is symmetric and is absolutely continuous 
with respect to the Lebesgue measure, having a derivative g' and finite Fisher informa- 
tion J (g') 2 1 g < oo. Below, we give a heuristic argument for deriving the semi-parametric 
efficiency and present a rigorous statement in a theorem. 

Suppose that g is known and p = 1. We write m(z) = mi(zi) + • • • + md(zd) and 
adopt the convention m,j(z) = m,j(zj). The logarithm of the joint density of (Y, X, Z) 
as a function of the parameters is given by i(f},m; (y,x,z)) = logg(y — xj3 — m(z)), ne- 
glecting those terms that do not depend on (/3,m), and the log-likelihood of (/3,m) by 
Y^i=i m 'i (X l i^ % i Z 1 ))- Let H denote the space of all additive functions m such that 
m(z) = toi(zi) + ■ • • + md(zd), Errij(Zj) = and Em(Z) 2 < oo. 

Calculation of the Fisher information in a semi-parametric model is made locally: fix 
a value (/3 ,m ) of the parameter (/3,m) and think of all 'regular' parametric submodels 
{(/3,mp) : (3 £ K} passing through (j3 ,m ), where m«o = m° and the mapping j3 i— > 
is Frechct diffcrentiable as a function from R to 'H. Define ^ = ,g'/.9- Then, each finite- 
dimensional submodel {(j3,mp) : (3 £ M.} has the score function 

d£((3,m p )/d(3\p = po = d£((3 7 m )/df3\ p=po + de(f3°,m)/dm\ m=m o(5) 
= tp(e)X + <p(e)5(Z), 

where S = dmp/ df3\p = pn G H is the tangent of the mapping /3 at and dl/dra 

denotes the Frechet derivative of £ with respect to m. This gives the Fisher information 
for estimating (3 in each submodel as 1(5) = E[ip(e)X + ip(e)5(Z)} 2 . 

The Fisher information at m°) € R x "H in the /u/? semi-parametric model typically 
equals to the Fisher information at (/?°,m°) elxH in the most difficult parametric 
submodel that gives minimal 1(5) . Theorem 1 below demonstrates that this is the case 
with our problem. The least favorable direction 5* that minimizes 1(5) over 5 £ Ji is the 
solution of the following integral equation: for all 5 £ H, 

= E[ip(e)X + ip(e)S*(Z)}ip(e)S(Z) 
= I g -E[(E(X\Z)+5*(Z))5(Z)}, 

where I g = J (g'f/g. This shows that 5* = -U(E(X\Z = -)\U), where U(-\n) denotes the 
projection operator onto H, and that the 'curve' m% corresponding to the least favorable 
submodel equals = (J3° — f3)H(E(X\Z = -)\H) + m°. The Fisher information for the 
least favorable submodel is thus given by 1(5*) = I g ■ E[X — U(E(X\Z)\H)} 2 , where, with 
a slight abuse of notation, we write IL(E(X\Z = -)\H)(Z) = IL(E(X\Z)\H). 

The above arguments can be generalized to the case where p > 1. Writing r)j = 
U(E(Xj\Z = ■)\H) and t] = (rji, . . . , r] p ) T , the least favorable direction equals 5* = —r) 
so that the Fisher information matrix for the least favorable submodel equals 1(6*) = 
I g ■ E[X - r)(Z)\ [X - ?7(Z)] T . In the following theore m we show that the Fisher informa- 
tion 1(5*) given above is indeed the semi-parametric information bound, as defined in 
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[3] , in our original semi-parametric model where the error density g and the density q of 
the covariate (X, Z) are not specified. To state the theorem, let Q denote the set of all 
symmetric and absolutely continuous (with respect to the Lebesgue measure) functions 
g such that I g < oo. Let Q be an arbitrary class of density functions q. For the spaces of 
to, we consider Hilbcrt spaces defined by 

f d 1 

H(?) = |me L 2 {q) : m(z) = ^to^z.,-) and Em.j(Zj) = for all 1 < j < d > , 

where -^(g) denotes the space of functions m:M. d — > M such that P g m(Z) 2 < oo and 
E q means the expectation under the density q. The semi-parametric model (2) un- 
der study is then expressed as V = {p{-;(3, m,g,q):(3£ W, to G H(q),g Q}. Let 
(/3°, to , go, go) be a fixed point where we are calculating the semi- parametric Fisher infor- 
mation. Denote by Po the distribution corresponding to (/? , to , go, qo), and by I(Pq\/3, V) 
the semi-parametric Fisher information at Po for estimating [3 under the model V . In 
the theorem below, the 'efficient score' £* for estimating f3 is the score for (3 at /3° in the 
least favorable parametric submodel that is indexed only by (3 and passes through Po. 
Let Eq denote the expectation under Po. 

Theorem 1. The efficient score at Po for estimating (3 is given by 
r(x,z,y; Po|/3, V) 

= -[x-r 7 (z)]^(y-x T /3 -TO°(z)), 

where r\ = (H[Eo(Xj\Zi = ■)\'H{qo)])j = i- The information bound at Po for estimating [3 
equals I(P Q \f3, V) = I go -E Q [X- rj(Z)} [X - r,(Z)] T . 

A proof of Theorem 1 can be found in an extended version of this paper that can be 
downloaded from http : //stat . snu. ac .kr/theostat/papers/BEJ296_ExtendedVersion. 
pdf. 

Let Ppl D V denote the semi-parametric model (1). One can show I(Pq\/3, Ppl) = 
I go ■ E Q [X- P (X|Z)][X - E Q (X.\Z)] T using the arguments to derive I{P \/3,V). Note 
that I(Pq\{3,V) > /(Po|/3,Ppl) by the property of conditional expectation, and that 
the equality I(P \f3,V) = I(P o \0,V Ph ) holds if Eo(Xj\Z = z) are additive for all 1 < 
j < d. According to the theory of semi-parametric efficiency, the minimal asymptotic 
variance that any regular estimator of (3 can achieve equals the inverse of the Fisher 
information matrix. The inequality I(P \f3,V) > J(P |/3, P PL ) implies /(Pol^P) -1 < 
/(Po|/3,Ppl) _1 , with equality holding if Eq(Xj\Z — z) are all additive. 

Theorem 2. Suppose I{P \(3,V Ph ) is positive definite. Then, I(Po\f3,T>)- 1 < 7(P |^,Ppl) _1 
unless Eo[r)(Z) — £'o(X|Z)][r;(Z) — E (X.\Z)] T = O, where O is the px p matrix with all 
entries being zero, and A < B means that B — A is non-negative definite and Ay^B. 
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Theorem 2 tells that using an additive model for the nonparametric part can lead 
to drastic gains of efficiency in the estimation of the parametric components. The ef- 
ficiency gains occur if the parametric covariates X are approximated by non-additive 
transformations of the nonparametric covariates Z. If the approximation is exact, then 
estimation of the parametric part in the partially linear model (1) breaks down since 
Z(Po|/3,'Ppl) = O, while it does not with the partially linear additive model (2). If the 
approximation is very crude, one has large efficiency gains by using additive models for 
the nonparametric part. 

3. Semi-parametric efficient estimation 

Let (3° and m° denote the true parameter values. In this section we present the semi- 
parametric efficient estimator of /3° that achieves the minimal asymptotic variance 
I(Pq |/3, T')~ 1 . The construction is based on a smooth backfitting technique and a profiling 
method. The latter is basically for estimating the least favorable curve, and is applied 
to the Gaussian error model to produce an initial estimator of /3° to be used in the 
construction of the semi-parametric efficient estimator. 

3.1. Smooth backfitting methods 

The smooth backfitting method, introduced by Mammen, Linton and Nielsen [10], is 
known to be a powerful technique for estimating additive regression functions. Since our 
profiling method involves smooth backfitting for non-additive functions, we discuss some 
properties of the method when the target function is not additive. 

Let W be a random variable and {W-™} be a random sample distributed as W. The 
smooth backfitting estimator, m^J d (z) = m^ d + m^ d (zi) + • • • + m^ d d (zd), with re- 
sponses W l and regressors Z l , are defined as the solution of following integral equations: 

d 

^w d 3 =rhw, 3 ~ £ fl,(m^)-m^ d , l<j<d, (3) 

1=1,^3 

with the constraints (™^ d , 1) = for 1 < j < d. Here, — n" 1 Y^i=i W % and rh\y,j (zj) 
denotes the marginal regression kernel estimator obtained by regressing W l on Zj only. 

The operator IIj stands for a projection onto a Hilbert space equipped with a scalar 
product (•,■); sec [23] for details. For example, in the case where fhw,j(zj) are the local 
constant marginal estimators, (g,h) = J g(z)h(z)qz(z) dz, with qz(-) being the kernel 
estimator of the design density qz- Smoothing to the direction of Zj is done by the 
boundary corrected kernel (u, v) — Cj(v)h~ 1 K°((u — v)/hj), where K° is a base kernel 
function, hj is the bandwidth, and Cj{v) is a factor that gives J Kf l .{u 1 v)Au = 1. 

Let mw{z) = E(W\Z = z). We do not assume that m\y is an additive function. De- 
fine m^} d = + • • • + m^ d d to be the projection of raw onto the space of additive 
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functions H(q Z ). Then, E[m w {Z) - E(W) - m^ d (Z)](5(Z) = for any 5 g %(<? z ). The 
additive function m^J d (z) plays the role of the target function that the smooth backfitting 
estimator m^J d (z) aims at. Lu et al. [9] discussed the property of the smooth backfitting 
estimators under non-additive regression models in the context of spatial data analysis. 
However, they treated only the case where the bandwidth is asymptotic to n -1 / 5 . Below, 
we give a uniform expansion of the smooth backfitting estimator for a wider range of the 
bandwidths, after tedious asymptotic calculation following the lines of the arguments in 
[10]. To state the theorem, let e = W — E(W) — m^ d (Z) and define e l accordingly. Let 
m E j(zj) and rh^j(zj) denote, respectively, the local constant and linear estimators with 
responses e l and the scalar regressors Zj. Let hj be the bandwidth associated with Zj. 
The theorem relies on the following assumptions. 

Assumptions A. 

Al. For l<j^k<d, qzj.z k are bounded away from zero and infinity on its support, 

[0, l] 2 , and have continuous partial derivatives. 
A2. The base kernel function K° is symmetric, supported on a compact support and 

has bounded derivative. 
A3. The functions m^ d 's are twice continuously differentiable. 
A4. E\W -m w (Z)\ r ° <oo for some r Q >5/2. 

Theorem 3. Assume that the conditions A1-A4 hold, and that hj are asymptotic to 
n~ a for 1/5 < a < 1/2. Then, for l<j<d, it holds that 

sup \m^(zj) - mfyfaj) - hjcuj^izj) - h 2 j a 2 j(z :j ) - m e ,j{z 3 )\ = o p {{nh 3 )- 1/2 ) 
«j-e[o,x] 

in the local constant case, and that 

sup \m^(zj) - mfyfaj) - hfajfa) - m%(zj)\ = o p ((n/ lj )" 1/2 ) 
*je[o,i\ 

in the local linear case, for some functions ai,j,n that are uniformly bounded and non-zero 
only for Zj € [0,chj) U (1 — chj, 1] for some constant < c < oo, and for some functions 
<22j and a^j that are continuous. 

A proof of Theorem 3 can be found in an extended version of this paper that can be 
downloaded from http : //stat . snu. ac .kr/theostat/papers/BEJ296_ExtendedVersion. 
pdf. 

3.2. Profiling with Gaussian error models 

We apply a profiling technique to remove the infinite-dimensional parameter m in the 
estimation of /3 . For a general framework of profiling approaches to semi-parametric 
models, we refer to [18]. See also [12] for a more recent work on profile likelihood. 
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Define d = (1^X1 , ■ ■ ■ , rh^ d ) T . We note that to^ 1 d is an estimator of rj and m|? d is 
an estimator of f3 0T q + m°. For each given [3, let m add (z; [3) = Yfj=i rn-j dd (zj', (3) be the 
smooth backfitting estimator obtained by taking Y l - X' T /3 = X lT (/3° - (3) + m°(Z l ) + 
e 1 as responses and Z l as covariates. Recall that the least favorable curve is given by 
m*(-,/3) = r) T (f3° — (3) + m°. Thus, we may regard m add (-;/3) as an estimator of the least 
favorable curve m*(-,/3). Since m add (z;/3) = my dd (z) - m^ d (z) T f3 by the fact that the 
smooth backfitting operation is linear in response vectors, the estimated profile likelihood 
based on the Gaussian error model is given by 

n n 

i=i i=i 
The estimator that maximizes the above Gaussian profile likelihood is then given by 

p=(j2* l x lT ) (ex's*), 



where X 1 = X 1 - m a c dd (Z l ) and = Y l - m| dd (Z i ). 

Theorem 4. Suppose that the assumptions A1-A4 hold with W = Y and Xj, l<j<p. 
Also, assume that E[exp(\Xj - E(Xj\Z)\)\Z] < C a.s. for some C > 0, l<j<p. If the 
bandwidths hj are asymptotic to n~ a for 1/5 < a < 1/2, then it holds that 

v/SG9 - (3°) 4 N(0, var(e)[£(X - r,(Z))(X - r^Z))^" 1 ). 

A proof of Theorem 4 is given in the Appendix. We note that the asymptotic variance 
of the estimator (3 is larger than I(Pq\/3, V)^ 1 . This can be seen directly from a projection 
property. In fact, var(e) > 1^ and the equality hold if g is Gaussian. This means that 
the estimator (3 achieves the semi-parametric efficiency in the reduced model where g is 
specified as a Gaussian density. It is also interesting to see what happens if 770 (X,Z) = 
Eo(Y\X., Z) does not belong to the partially linear additive model of the form (2). In this 
case, our estimator of 770 converges to 77* , which is the L2 (^-projection of 770 onto the 
space 

F={f€ L 2 (q) I /(x, z) = /3 T x + m(z), f3 e W, m g H}. (4) 



3.3. Adapting to unknown error density 

In this subsection, we construct the semi-parametric efficient estimator that achieves the 
minimal asymptotic variance discussed in Section 2. We follow the approach adopted 
by Bickel [2], Schick [16, 17], Park [14], Cuzick [5] and Bhattacharya and Zhao [1]. 
Write I = l{p a \(3,V) and define f3* n = [3° - I" 1 ™" 1 X™=i [X 1 - n(Z l )](p(e). Then, the 
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random sequence (3* n achieves the efficiency bound. Wc plug some estimators of the 
unknown quantities into /3* . We estimate the error density g by using the 'pseudo' errors 
e 1 = Y l — X lT /3, where /3 is the Gaussian profile estimator constructed in Section 3.2. 
In particular, we take g(t) = b + (na)^ 1 Y^i=i — ^)/ a ) an d = d.9(i)/di, where 
a and b are positive constants that depend on the sample size n, and L is a symmetric 
diffcrcntiable density function. Define 



where (p is the 'symmetrized' estimator of ip defined by <p(e) — [(g' ' / 'g)(e) — (g' / g){—e)\/2. 
Our scmi-paramctric efficient estimator is then given by 



Assumptions B. 

Bl. The error e has an absolutely continuous and symmetric density g with respect 

to the Lebesgue measure, [i, and I g = f(g' 2 /g)d[i < oo. 
B2. The kernel L is a symmetric density function with three bounded and Lipschitz 

continuous derivatives. 
B3. The sequences a and b converge to zero, as in oo, and satisfy n}/ 2 hjb(a? A 

b 2 ) — > oo and a 2 /{hj (log n) 2 } —> oo for all l<j<d. 

Theorem 5. Assume that the conditions of Theorem 4 and the assumptions 151-B3 



A proof of Theorem 5 is given in the Appendix. For a choice of the bandwidth a in 
g, one can devise a data-driven choice along the lines of Park [15]. For h, one can follow 
the approach of Mammen and Park [11]. In this adaptation step, misspecification of the 
model may result in a meaningless estimator. This is in contrast to the estimation in 
the initial step where the procedure estimates the projection of the mean function onto 
the model space T at (4). The reason is that the residuals from the initial step include 
not only the pure errors but also the deviation of the true regression function from its 
projection onto T . These residuals mislead estimation of the score function. 

4. Numerical properties 

We generated 500 random samples of the size n = 400. We used Epanechnikov kernel for 
the regression and the Gaussian density kernel for the estimation of the score function. 
We applied a local constant version of smooth backfitting. Wc took mi(zi) = sin{27t(2:i — 
0.5)} and m 2 (z 2 ) =z 2 - 0.5 + sin{27t(z 2 - 0.5)}. We set m = 3, ft = 1.5 and ft = 0.8. We 




n. — ' 



i=l 



hold. Then, 7^(3-/3°) 4 (0, J(P |/3, P)" 1 ) . 
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drew (Z U Z 2 ) fromiV 2 ((0.5,0.5) T ,I]) truncated to [0,1] 2 , where £ = {(l-p)J+pll T }/4. 
We generated X\ — CZ\{1 — 2Z 2 ) + U for some constant C, where U ~ N (0,0.5), and X 2 
from Bcrnoulli^^^Zi,^)), where p(X 1 ,Z 1 ,Z 2 ) = g(exp((Zi + Z 2 )/2) + sm(2nZ{) - 
X\) and g(t) = exp(i) /(l + exp(i)). Note that E(Xi\Z = •) is orthogonal to the space of 
additive functions. 

We compared the Gaussian profile estimator (SAM), given in Section 3.2, and the 
profile kernel estimator (PL), given in [19], which is for the partial linear model without 
the additive structure. For this, we generated e from N(0, 1) and set p = 0. In the case 
where p=l, that is, X 2 does not enter the model, the theoretical value of the ratio of the 
asymptotic variance of SAM to that of PL equals 1/(1 + 0.1707C 2 ). The empirical values 
from our simulation study for the bandwidth pair (hi, h 2 ) that gave the best mean square 
error (MSE) were 0.7818, 0.5868 and 0.4082 for C = 1, 2 and 3, respectively, which nearly 
coincided with the theoretical values. We tried other values of p, but the lesson was the 
same. In the case where p = 2 and d — 5 with (Z\, . . . , Z5) from A r s((0.5, . . . ,0.5) T , S) 
truncated to [0, l] 5 and rrij(zj) = z 2 for 3 < j < 5, we took C = 1 and found that SAM 
beat PL for all bandwidth choices that we tried. The Gaussian profile estimator was 
stable while PL broke down for small bandwidths. The best MSE of SAM and that of 
PL, respectively, for various choices of the bandwidth pair (hi, h 2 ) were 0.0032 and 0.0051 
for Pi and 0.0186 and 0.0269 for /3 2 . 

Next, we compared SAM with the semi-parametric efficient estimator (ASAM). For 
this, we considered the case where p = d = 2,C=l and p = 0.8, and generated e from 
N(0, 1), i-distribution with degree of freedom 3, and |JV(— 1.5, 0.6 2 ) + ±AT(1.5, 0.6 2 ). For 
ASAM, we took b = 0.01, and six different choices of a: aj = 0.3 + 0.1i,0 < i < 5, for 
JV(0, 1) and i(3) errors and Oj = 0.1 + O.li, < i < 5, for the Gaussian mixture error. 
We used 36 different choices for the bandwidth pair (hi,h 2 ) £ {0.05, 0.10, 0.30} 2 . 
Figure 1 is for the estimators of /3i. Each box-plot was obtained from the 36 values of 
MSE that corresponded to the 36 bandwidth pairs (hi,h 2 ). For ASAM, the value of a 
is indicated on the horizontal scale. The figure suggests that the values of the MSE of 
ASAM are far smaller than those of SAM for the entire range of the bandwidth a, under 
i(3) and the Gaussian mixture error models. The box-plots for the Gaussian error model 
are not given here since SAM and ASAM gave similar performance. The results for (3 2 
are not reported either since they give a similar lesson. 

5. Boston housing data 

We applied the semi-parametric efficient estimators to Boston housing data as an illus- 
tration. As in [6, 22], we took the median price in 1,000 USD (MEDV) as the response 
Y. Also, we chose as covariates Xi, X 2 and Zi, . . . , Zq, respectively, the eight variables 
LSTAT (percentage values of lower status population), CHAS (a dummy variable that 
takes the value 1 if the tract borders Charles River; otherwise), CRIM (per capita crime 
rate), RM (average numbers of rooms per dwelling), NOX (nitric oxides concentration), 
PTRATIO (pupil-teacher ratios), DIS (weighted distances to five Boston employment 
centers) and TAX (full-value property tax rate per 10,000 USD). The logarithms of 
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LSTAT, DIS and TAX were taken to reduce sparse areas, as in [22]. We chose the model 

Y = mo + PiXi + P2X2 + Y^j=i m j (Zj) + e. In the data set, there were 16 cases for which 

Y took the maximal value 50. These may be censored responses that one may remove 
from analysis. Indeed, an initial analysis showed a strong asymmetry in the distribution 
of the residuals, which led us to exclude the 16 cases for further analysis. For additive 
regression, we applied local constant smooth backfitting with the Epanechnikov kernel 
and bandwidths hj chosen by a rule of thumb. 

With SAM, we obtained pi = —6.203 and $2 = 0.985. Their estimated standard errors 
were 0.420 and 0.597, respectively. This suggests that P2 is not strongly significant while 
Pi is. The generalized R 2 was 0.862. For ASAM, in the estimation of the score function, 
we used a bandwidth a that was obtained by R function bw.SJQ. With ASAM, we got 
Pi = —6.172 and P2 = 1.366, and their estimated standard errors were 0.399 and 0.567, 
respectively. Thus, with ASAM, both the estimated coefficients arc strongly significant. 
This may be an indication that a Gaussian error model is not appropriate for the data 
set. The generalized R 2 was almost the same as in the analysis with SAM. 



746 



K. Yu, E. Mammen and B. U. Park 



Appendix 

Proof of Theorem 4. We only treat the case with local constant smooth backfitting. 
The case with local linear smooth backfitting can be dealt with similarly. We prove 



-1/2 



(5) 



Write A (z) =m°(z)- 



, add 



where d = n' 1 ' 2 EtiO^ 1 ~ »7(Z l ))A(Z 4 ), C 2 



(z; /3°) . The left-hand side of equation (5) equals C\ + C2 + C3 



-1/2 



^ 3 = ^ 1/2 Er=i(^( Z4 )-™x d (Z l ))A(Z'). Write A(z) = A + £j =1 A,-(^)- By Theo- 
rem 3, standard techniques of kernel smoothing, integration by part and the representa- 
tion of m° and m add (z; f3°) as a solution of an integral equation with differentiable kernel 
(see equation (3)), we have 



sup 

zS[0,l]' 



|A(z)| =o p {6„ 



sup 

ye [0,1] 



hjb n j(zj) 



o p (5 n ) 



for some uniformly bounded non-random functions b n j, where 6 n = n a for some 
a £ (0, 1/2 — a). These imply that S~ 1 A 6 P(0, 1) with probability tending to one, where 
B(0, 1) denotes a class of additive functions ^2j=i 9j( z j) such that each gj is a real func- 
tion defined on [0,1] and satisfies sup t t , g [ X ] \gj(t) — gj(t')\ < \t — t'\. The covering number 
with bracketing of P(0,1) with respect to sup-norm, N[.](r]) = iV[.j (77, -B(0, 1), || ■ ||oo), 
is bounded by (2rj- 1 ) d 3 dr '~ 1 . Define random functional F{X i -, Z ? ) : B(0, 1) — > M by 
[F(Xj, #)](«?) = (X*-r h m)gm, and F j :B(0, 1) R by F; = n" 1 /*^ F {X}, Z*). 
Then, using Corollary 8.8 of van de Geer [20] and the tail condition assumed in the 
theorem, one can show sup sgB ( 0jl ) |F,g| = O p (l). Let d,j denote the jth element of 
Ci. Since PQS^dJ > M) < P(s'up flgB(0il) \F jg \ > M) + P(^ 1 A £ 5(0, 1)), we obtain 
Cij = O p (S n ) = o p (l). One can prove d = o p (l) using a truncation argument with The- 
orem 3 and applying the Chebyshev inequality conditioning on (X J ,Z l ). The fact that 
C3 = o p (l) follows from P(Zj lies in \0,chj) U (1 — chj, 1]) = 0(hj) for some constant 
< c < 00 and Theorem 3. □ 



Proof of Theorem 5. We will show that (3 — /3* = o p (n x / 2 ). It suffices to show 

n n 

r'n- 1 XV(n = 3 " /3° + P 1 ^ 1 ]T[X< - ^(Z*)]^) + o^n- 1 ^). ( 6 ) 

i=l i=l 

By Theorem 3 and standard techniques of kernel smoothing along with assumption B3, 
it holds that, uniformly over i, 

0(e«') = 0(e*) - X lT (^J - /3V(0 - {^(Z 1 ;^) - m°(Z')OT) + o^ 1 / 2 ). (7) 
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Also, using the proof of Lemma 4.1 in [2] and standard calculus, one can show I = 
I + o p (l) and n~ 1 X;r=iX l X lT (p'(e') = -I + o p (l). Thus, the proof of the theorem is 
completed if we verify 

n 

n 1 ^X l {m add (Z l ; 0) - m°(Z*)}£V ) = o^ 1 / 2 ); (8) 

i=l 

n n 

n- 1 Y, X W) ri- 1 £{X J - v(Z 1 )}^) = Op(n" 1/2 ). (9) 

i=l i=l 

Proofs of (8) and (9) can be based on the following lemma, which follows from Corollary 
2.7.4 in [21] and assumption B2 on L. Note that the moment condition on e ensures the 
entropy bound. To state the lemma, define 

C a M {X) = If : X -j. R: sup \f(x)\ + sup ^I^tJMl < M ) 

for a set ^ CK and a real number a £ (0,1]. Let || • || s denote the L2 norm with respect 
to the density g. 

Lemma 1. Assume the conditions of Theorem 5. Then there exists a constant M such 
that, with probability tending to one, b(a A b)(p € Cj^(R), [nh ma , x a 6 b / (log n) 2 ] 1 / 2 (ip~ tp n ) € 
C\j(M) and b(a 2 A b 2 )(p' G C| / (R). Moreover, there exist constants 5 > and Ci > suc/i 
logiV [ . ] (r 7 ,Cl f (R) ! || • || g ) < C^ 2 ^. □ 
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